Exploratory Data Analysis of Airbnb Accommodation in Copenhagen
An Exam Project in Business Data Processing and Business Intelligence
- !¤%& 1.0 ABSTRACT
- 2.0 INTRODUCTION
- 3.0 METHODOLOGY
- Install Libraries
- Gathering the Data
- Preprocessing
- EDA
- Visualizations
2.0 INTRODUCTION
Since 2008, Airbnb has grown from a small accommodation platform, hosted in San Francisco, to one that is now recognised throughout the world. Airbnb has revolutionized the tourism housing industry by applying a sharing economy model to the accommodation business. Today, Airbnb has become the world’s largest accommodation service provider with more accommodation options than any other accommodation business - and even more than all of them combined. As a platform, Airbnb enables people (hosts) to offer accommodation services to other people (guests), providing guests with a more unique and personalized way of experiencing the world, and often at a reasonably lower price than other accommodation options. Only just a fraction (20%) of these transactions are captured by Airbnb, which in 2019 returned 4,7 billion USD in sales revenues.
2.1 PROBLEM FORMULATION AND RESEARCH QUESTION
Data plays a key role in Airbnb’s success. For instance, data enables Airbnb to match guests and hosts and further allows the users to filter the host listings to their likings, in respect of pricing, location, number of beds, and much more. Thereby, data is essential to securing high customer satisfaction. Moreover, Airbnb can use the collected data to extract insights that can be used to improve their service offerings, guide decision making, guide marketing initiatives, and more.
As a platform, Airbnb’s sole value creation lies in creating successful matches between guests and hosts and by ensuring a positive experience for both parties. Naturally, if the platform fails to deliver a positive experience to a user, the user might neglect the platform in total, resulting in negative feedback loops. This leads us to our research question:
How can Airbnb ensure matches and the experiences they create are positive for their customers (users), and providers (hosts)? Moreover, how can Airbnb help guide user decisions to create successful matches and positive experiences?
Currently, Airbnb helps the users to create meaningful matches, by allowing the guests to limit their search for accommodation by different attributes related to the individual host listing. As such, users can easily find accommodation that meets their basic needs for accommodation; e.g. number of beds, bedrooms, price, room type, etc. However, without any knowledge of the different location areas, guests might find difficulty in choosing a location that suits their needs.
In this project, we will examine the accommodation services, listed on Airbnb for Copenhagen, in special regard to the location areas, and the attributes that are associated with them. The goal is to create a report that can guide customers to choose a location that lives up to their expectations, thereby improving the quality of the matches provided by the platform.
3.1 DATASET DESCRIPTION
The data was downloaded from the independent site: Inside Airbnb, which scrapes data from Airbnb, and makes it puplicly available for analysis. This site provides a multitude of datasets containing information on the most populated cities around the world - including Copenhagen.
The datasets provided by Inside Airbnb is as follows: (1) listings, (2) calendar, (3) reviews, (4) listings_summary, (5) reviews_summary.
We have downloaded and inspected all of the datasets, however the calendar dataset is assessed to be unimportant. Thus, the listings and reviews data has been chosen to conduct this project. Furthermore, we chose the most recently scraped data: 28th of Nov. 2020.
3.1.1 LISTINGS.CSV.GZ
The listings dataset contains data about the airbnb host listings and their respective attributes. In total, there are 74 columns describing 8636 listings on the Airbnb platform.
Here, we provide you with a short glimpse of some of the attributes in the listings dataset:
- id: primary key (listings_id)
- name: name of the listing
- neightbourhood_cleansed: location area
- latitude: latitude
- longitude: longitude
- beds: number of beds in the room
- bedrooms: number of bedrooms in the room
- price: price of the room per day
- room_type: type of room that is made available
- property_type: type of the property where the room is in
- review_score: average review score of the listing
3.1.2 REVIEWS.CSV.GZ
The reviews dataset contains data about reviews that were given for the listed accommodation services. In total, there are 6 columns describing 185.564 reviews.
Here, we provide you with a short glimpse of some of the attributes in the reviews dataset:
- listing_id: foreign key references listings
- id: primary key (review_id)
- date: date review was written
- reviewer_name: name of reviewer
- comments: review text
3.2 Dataset Analysis Process
For this project we perform an exploratory data analysis of the data, as well as visualize the processed data in an interactive map, and furthermore display wordclouds of the review text.
Before we can start the analysis we install the necessary libraries for our python interpreter to work with the data. Specifically, we will be using pandas to create and manage the data in pandas dataframe. We use Plotly's Express library to visualize the data in an interactive map. Furthermore, we use WordCloud and Matplotlib's Pyplot to visualize wordsclouds of the review texts.
Once the necessary packages are installed and imported, we can begin to gather the data and initiate the data cleaning process. We use pandas to download the data from Inside Airbnb and to uncompress it, using the in-built decompressor.
We then give a quick glimpse of the datasets that we gathered, before beginning the data cleaning. To clean the data, start by removing columns that are empty, listings that has never been reviewed, renaming columns to be easier interpretable, and finally correcting the data types.
We should now have clean data, that we can use to analyze the attributes of the listings. Here, we will investigate the distributions of prices, neighbourhoods, property types, and room types. Similarly, we will investigate the average prices of listings by neighbourhood, property type, and by room type. In conclusion, we aim to list differnces that occour for each different category.
Nearing the end, the two datasets are merged into one dataframe, that contains data about both listings and reviews. These are joined by the listings_id using the 'inner' property. We can then use this dataframe to display the interactive map, and wordclouds that summarized the reviews of each neighbourhood.
Now, let's get started!
import pandas as pd
import requests
import matplotlib.pyplot as plt
import plotly.express as px
import numpy as np
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
#Create DataFrames
listings = pd.read_csv('http://data.insideairbnb.com/denmark/hovedstaden/copenhagen/2020-11-28/data/listings.csv.gz', compression='gzip')
reviews = pd.read_csv('http://data.insideairbnb.com/denmark/hovedstaden/copenhagen/2020-11-28/data/reviews.csv.gz', compression='gzip')
listings.head()
listings.info()
listings.shape
reviews.head(20)
reviews.info()
reviews.shape
#Removing empty columns
listings.drop(columns=['neighbourhood_group_cleansed','bathrooms','calendar_updated','license'], inplace=True)
#Deselect Listings with no reviews
listings = listings[listings.number_of_reviews > 0]
#Rename columns
listings.rename(columns={'id':'listing_id','name':'listing_name','description':'listing_description'},inplace=True)
#Correct DataTypes
listings = listings.astype(
{
#DateTime:
'last_scraped':'datetime64[ns]',
'host_since':'datetime64[ns]',
'calendar_last_scraped':'datetime64[ns]',
'first_review':'datetime64[ns]',
'last_review':'datetime64[ns]'
}
)
#Correct Prices from $ to DKK, then DataType
listings.price = listings.price.str.replace(',','')
listings.price = listings.price.str.replace('$','')
listings.price = listings.price.astype(float)
#Rename Columns
reviews.rename(columns={'id':'reivew_id','date':'review_date','comments':'review_text'},inplace=True)
#Change DataTypes
reviews.review_date = pd.to_datetime(reviews.review_date)
neighbourhoods = listings['neighbourhood_cleansed'].value_counts().to_frame(name='listings').reset_index()
neighbourhoods
listings[['property_type']].value_counts().to_frame(name='listings').reset_index()
listings[['room_type']].value_counts().to_frame(name='listings').reset_index()
#Select properties listed more than 400 times
listings_clean = listings[listings.property_type.isin(['Entire apartment','Private room in apartment','Entire condominium','Entire house'])]
#Count number of listings in neighbourhoods by property type
listings_byNeighbourhood = listings_clean.groupby(['neighbourhood_cleansed','property_type']).neighbourhood_cleansed.count().to_frame(name = 'listings').reset_index()
#Sum number of listings per neighbourhood
listingsNeighbourhoodCount = listings_byNeighbourhood.groupby('neighbourhood_cleansed')['listings'].sum().to_frame(name = 'total_listings').sort_values(by='total_listings', ascending=False).reset_index()
#Calculate ratio of property types in the different neighbourhoods
neighbourhoodPropertyRatio = listings_byNeighbourhood.merge(listingsNeighbourhoodCount, on='neighbourhood_cleansed')
neighbourhoodPropertyRatio['ratio_of_property_type_in_neighbourhood'] = neighbourhoodPropertyRatio['listings']/neighbourhoodPropertyRatio['total_listings']*100
neighbourhoodPropertyRatio.head(10)
#Count number of listings in neighbourhoods by property type
roomCount = listings.groupby(['neighbourhood_cleansed','room_type']).neighbourhood_cleansed.count().to_frame(name = 'listings').reset_index()
#Sum number of listings per neighbourhood
roomNeighbourhoodCount = roomCount.groupby('neighbourhood_cleansed')['listings'].sum().to_frame(name = 'total_listings').sort_values(by='total_listings', ascending=False).reset_index()
#Calculate ratio of property types in the different neighbourhoods
roomRatio = roomCount.merge(listingsNeighbourhoodCount, on='neighbourhood_cleansed')
roomRatio['ratio_of_room_type_in_neighbourhood'] = roomRatio['listings']/roomRatio['total_listings']*100
roomRatio.head(10)
df = pd.DataFrame()
df['avg_n_accommodations'] =listings.groupby('neighbourhood_cleansed').accommodates.mean()
df = df.reset_index()
df
neighbourhoodPricing = listings.groupby('neighbourhood_cleansed').price.mean().to_frame().sort_values(by='price', ascending=False).reset_index()
neighbourhoodPricing
#Calculate Average Price Per Person
df['price'] = neighbourhoodPricing.price
df['price_perPerson'] = df.price/df.avg_n_accommodations
df
PropertyPricing = listings.groupby('property_type').price.mean().to_frame().sort_values(by='price', ascending=False).reset_index()
PropertyPricing.head(20)
roomPricing = listings.groupby('room_type').price.mean().to_frame().sort_values(by='price', ascending=False).reset_index()
roomPricing.head(20)
#Merge reviews and listings
group_listingReviews = reviews.merge(listings, on='listing_id', how='inner')
#Define mapbox API token and style
mapbox_access_token = 'pk.eyJ1IjoiYWNodG9uMjExMSIsImEiOiJja2lyam5yemgyNTV0MnJsYmJ0NXdzNWRxIn0.rWJgur27hJnWoBt7Oq5LeQ'
px.set_mapbox_access_token(mapbox_access_token)
plot_style = 'mapbox://styles/achton2111/ckirsv5df0aj01at4zp0d7f3w'
#Interactive Geospacial plot
fig = px.scatter_mapbox(group_listingReviews,
lat="latitude",
lon="longitude",
color="neighbourhood_cleansed",
zoom=10,
size='price',
mapbox_style= plot_style,
hover_name='listing_name',
hover_data = {'price',
'property_type',
'room_type',
'accommodates',
'beds',
'review_scores_rating'},
opacity = 0.8,
title = 'AirBnB Listing Locations. Coloured by Neighbourhood, Size by Price)'
)
fig.show()